Introduction into data visualisation with R & ggplot2

Felix - CorrelAidX Austria

CorrelAid

We are a network of over 2.000 volunteers, who want to improve the world through data science

  • 2.400 data analysts in our interdisciplinary and diverse network

  • 57 finished projects with well-known NPO since 2015 & >60 workshops

  • 13 local chapters in Germany, the Netherlands, France, Switzerland, and Austria

Join us for data4good projects, workshops and exchange with a great network

https://www.correlaid.org & austria@correlaid.org

First steps in R

# first time using the packages, you need to install them
#install.packages("tidyverse")
#install.packages("palmerpenguins")

# load packages
library(tidyverse) # loads also ggplot2
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   1.0.0 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.5.0 
✔ readr   2.1.3      ✔ forcats 0.5.2 
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
library(palmerpenguins) # our data for today
Warning: package 'palmerpenguins' was built under R version 4.2.3

palmerpenguins data

glimpse(penguins)
Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
head(penguins, n=3)
# A tibble: 3 × 8
  species island    bill_length_mm bill_depth_mm flipper_l…¹ body_…² sex    year
  <fct>   <fct>              <dbl>         <dbl>       <int>   <int> <fct> <int>
1 Adelie  Torgersen           39.1          18.7         181    3750 male   2007
2 Adelie  Torgersen           39.5          17.4         186    3800 fema…  2007
3 Adelie  Torgersen           40.3          18           195    3250 fema…  2007
# … with abbreviated variable names ¹​flipper_length_mm, ²​body_mass_g

Syntax of ggplot2

data

To create a visualisation the first line is always:

ggplot(data = <NAME OF DATAFRAME>)

The line tells R in which data object contains the variables you want to plot. It creates the first layer of the plot. Since no variables are specified, the first layer is an empty canvas

ggplot(data = penguins)

geom_function

Every additional layer is added with +

geom_function is the mandatory second line, and defines the type of plot1

With geom_bar() we create a bar plot and by specifying the variable species inside mapping = aes(x = species) we indicate for which variable we want to plot the number of observations for each category.

x- and y-axis are aesthetics and need to be specified inside aes().

ggplot(data = penguins) +
  geom_bar(mapping = aes(x = species))

multiple variables

Adding a second variable to the plot can be done through an additional aesthetic inside aes(). If the variables are defined correctly as numeric, factor or character ggplot chooses the right scale and name of the axes.

By adding fill = island inside aes() we are disaggregating the number of penguins species by the island where they live.

ggplot(data = penguins) +
  geom_bar(mapping = aes(x = species, fill = island))

ggplot(data = penguins) +
  geom_bar(mapping = aes(x = species, fill = island),
           position = "dodge")

ggplot(data = penguins) +
  geom_bar(mapping = aes(x = species, fill = island),
           position = "fill")

plot distributions

Up to now we used two categorical variables, but you can also plot the distribution of numeric variables.

Instead of geom_bar() we use geom_boxplot to illustrate the distribution of the penguins body_mass_g.`

ggplot(data = penguins) +
  geom_boxplot(mapping = aes(y = body_mass_g))

ggplot(data = penguins) +
  geom_boxplot(mapping = aes(y = body_mass_g, x = species))

plot multiple numeric variables

Other geom_functions such as geom_point() require two numeric variables.

ggplot(data = penguins) +
  geom_point(mapping = aes(y = body_mass_g, x = flipper_length_mm)) +
    stat_smooth(mapping = aes(y = body_mass_g, x = flipper_length_mm), method = "lm",geom = "smooth")

ggplot(data = penguins, mapping = aes(y = body_mass_g, x = flipper_length_mm)) +
  geom_point() +
    stat_smooth(method = "lm",geom = "smooth")

ggplot(data = penguins, mapping = aes(y = body_mass_g, x = flipper_length_mm, colour = species)) +
  geom_point() +
    stat_smooth(method = "lm",geom = "smooth")

adding labels

Every part of the plot can be changes with the right function. With labs you can label the title, axes, legend, and also add notes.

ggplot(data = penguins, mapping = aes(y = body_mass_g, x = flipper_length_mm, colour = species)) +
  geom_point() +
  stat_smooth(method = "lm",geom = "smooth") +
  labs(title = "Do heavier penguins have longer flippers?",
       x = "lenght of flipper in mm",
       y = "weight in g",
       colour = "penguin species",
       caption  = "Data from palmerpenguins package")

now it is your turn